Survey on deep learning with class imbalance

您所在的位置：网站首页 › mortal imbalanced › Survey on deep learning with class imbalance

Survey on deep learning with class imbalance

2024-07-11 16:25:19| 来源: 网络整理| 查看: 265

Anand et al. [78] explored the effects of class imbalance on the backpropagation algorithm in shallow neural networks in the 1990’s. The authors show that in class imbalanced scenarios, the length of the minority class’s gradient component is much smaller than the length of the majority class’s gradient component. In other words, the majority class is essentially dominating the net gradient that is responsible for updating the model’s weights. This reduces the error of the majority group very quickly during early iterations, but often increases the error of the minority group and causes the network to get stuck in a slow convergence mode.

This section analyzes a number of deep learning methods for addressing class imbalance, organized by data-level, algorithm-level, and hybrid methods. For each surveyed work, we summarize the implementation details and the characteristics of the data sets used to evaluate the method. We then discuss various strengths and weaknesses, considering topics such as class imbalance levels, interpretation of results, relative performance, difficulty of use, and generalization to other architectures and problem domains. Known limitations are highlighted and suggestions for future work are offered. For consistency, class imbalance is presented as the maximum between-class ratio, $\rho$ (Eq. 1), for all surveyed works.

Data-level methods

This section includes four papers that explore data-level methods for addressing class imbalance with DNNs. Hensman and Masko [79] first show that balancing the training data with ROS can improve the classification of imbalanced image data. Then RUS and augmentation methods are used by Lee et al. [20] to decrease class imbalance for the purpose of pre-training a deep CNN. Pouyanfar et al. [21] introduce a new dynamic sampling method that adjusts sampling rates according to class-wise performance. Finally, Buda et al. [23] compare RUS, ROS, and two-phase learning across multiple imbalanced image data sets.

Balancing training data with ROS

Hensman and Masko [79] explored the effects of class imbalance and ROS using deep CNNs. The CIFAR-10 [80] benchmark data set, comprised of 10 classes with 6000 images per class, was used to generate 10 imbalanced data sets for testing. These 10 generated data sets contained varying class sizes, ranging between 6% and 15% of the total data set, producing a max imbalance ratio $\rho = 2.3$. In addition to varying the class size, the different distributions also varied the number of minority classes, where a minority class is any class smaller than the largest class. For example, a major 50–50 split (Dist. 3) reduced five of the classes to 6% of the data set size and increased five of the classes to 14%. As another example, a major singular over-representation (Dist. 5) increased the size of the airplane class to 14.5%, reducing the other nine classes slightly to 9.5%.

A variant of the AlexNet [17] CNN, which has proven to perform well on CIFAR-10, was used for all experiments by Hensman and Masko. The baseline performance was defined by training the CNN on all distributions with no data sampling. The ROS method being evaluated consisted of randomly duplicating samples from the minority classes until all classes in the training set had an equal number of samples.

Hensman and Masko presented their results as the percentage of correct answers per class, and included the mean score for all classes, denoted by Total. To ensure results were valid, a total of three runs were completed for each experiment and then averaged. Table 3 shows the results of the CNN without any data sampling. These results demonstrate the impact of class imbalance when training a CNN model. Most of the imbalanced distributions saw a loss in performance. Dist. 6 and Dist. 7, which contained very slight imbalance and no over-representation, performed just as well as the original balanced distribution. Some of the imbalanced distributions that contained over-represented classes, e.g. Dist. 5 and Dist. 9, yielded useless models that were completely biased towards the majority group.

Table 4 includes the results of training the CNN with balanced data that was generated through ROS. It shows that over-sampling performs significantly better than the baseline results from Table 3. Dist. 1 is excluded from Table 4 because it is already balanced, i.e. ROS is not applicable. In this experiment, ROS improved the classification results for all distributions. Dist. 5 and Dist. 9 saw the largest performance gains, increasing from baseline Total F1-scores of 0.10 up to 0.73 and 0.72, respectively. The ROS classification results for distributions Dist. 2–Dist. 11 are comparable to the results achieved by the baseline CNN on Dist. 1, suggesting that ROS has completely restored model performance.

Table 3 Imbalanced CIFAR-10 classification [79]Full size table Table 4 Imbalanced CIFAR-10 classification with ROS [79]Full size table

The experiments by Hensman and Masko show that applying ROS to the level of class balance can be effective in addressing slight class imbalance in image data. It is also made clear by some of the results in Table 3 that small levels of imbalance are able to prevent a CNN from converging to an acceptable solution. We believe that the low imbalance levels tested ($\rho = 2.3$) is the biggest limitation of this experiment, as imbalance levels are typically much higher in practice. Besides exploring additional data sets and higher levels of imbalance, one area worth pursuing further is the total number of epochs completed during training on the imbalanced data. In these experiments, only 10 epochs over the training data were completed because the authors were more interested in comparing performance than they were in achieving high performance. Running additional epochs would help to rule out whether or not the poor performance was due to the slow convergence phenomenon described by Anand et al.

At the time of writing, May 2015, Hensman and Masko observed no existing research that examined the impact of class imbalance on deep learning with popular benchmark data sets. Our literature review also finds this to be true, confirming that addressing class imbalance with deep learning is still relatively immature and understudied.

Two-phase learning

Lee et al. [20] combined RUS with transfer learning to classify highly-imbalanced data sets of plankton images, WHOI-Plankton [81]. The data set contains 3.4 million images spread over 103 classes, with 90% of the images comprised of just five classes and the 5th largest class making up just 1.3% of the entire data set. Imbalance ratios of $\rho > 650$ are exhibited in the data set, with many classes making up less than 0.1% of the data set. The proposed method is the two-phase learning procedure, where a deep CNN is first pre-trained with thresholded data, and then fine-tuned using all data. The thresholded data sets for pre-training are constructed by randomly under-sampling large classes until they reach a threshold of N examples. The authors selected a threshold of $N = 5000$ through preliminary experiments, then down-sampled all large classes to N samples. The proposed model (G) was compared to six alternative methods (A–F), a combination of transfer learning and augmentation techniques, using unweighted average F1-scores to compare results.

(A)

Full: CNN trained with original imbalanced data set.

(B)

Noise: CNN trained with augmented data, where minority classes are duplicated through noise injection until all classes contain at least 1000 samples.

(C)

Aug: CNN trained with augmented data, where minority classes are duplicated through rotation, scaling, translation, and flipping of images until all classes contain at least 1000 samples.

(D)

Thresh: CNN trained with thresholded data, generated through random under-sampling until all classes have at most 5000 samples.

(E)

Noise + full: CNN pre-trained with the noise-augmented data from (B), then fine-tuned using the full data set.

(F)

Aug + full: CNN pre-trained with the transform-augmented data from (C), then fine-tuned using the full data set.

(G)

Thresh + full: CNN pre-trained using the thresholded data set from (D), then fine-tuned using the full imbalanced data set.

Results from Lee et al.’s experiments are presented in Table 5, where the L5 and Rest columns are the unweighted average F1-scores for the largest five classes and the remaining minority classes, respectively. Comparing methods (B–D) shows that under-sampling (D) outperforms both noise injection (B) and augmentation (C) over-sampling methods on the given data set for all classes. Methods (B–D) saw moderate improvements on the Rest group when compared to the baseline (A), but the L5 group suffered sufficient loss in performance. Re-training the model with the full data set in (E–G) allowed the models to re-capture the distribution of the L5 group. The proposed model (G) achieved the highest F1-score (0.7791) on the L5 group, and greatly improved the F1-score of the Rest group from 0.1548 to 0.3262 with respect to the baseline.

Table 5 Two-phase learning with WHOI-Plankton (Avg. F1-score) [20]Full size table

The two-phase learning procedure presented by Lee et al. has proven effective in increasing the minority class performance while still preserving the majority class performance. Unlike plain RUS, which completely removes potentially useful information from the training set, the two-phase learning method only removes samples from the majority group during the pre-training phase. This allows the minority group to contribute more to the gradient during pre-training, and still allows the model to see all of the available data during the fine-tuning phase. The authors did not include details on the pre-training phase, such as the number of pre-training epochs or the criteria used to determine when pre-training was complete. These details should be considered in future works, as pre-training that results in high bias or high variance will certainly impact the final model’s class-wise performance. Future work can also consider a hybrid approach, where the model is pre-trained with data that is generated through a combination of under-sampling majority classes and augmenting minority classes.

The previous year, Havaei et al. [82] used a similar two-phase learning procedure to manage class imbalance when performing brain tumor image segmentation. The brain tumor data contains minority classes that make up less than 1% of the total data set. Havaei et al. stated that the two-phase learning procedure was critical in dealing with the imbalanced distribution in their image data. The details of this paper are not included in this survey because the two-phase learning is just one small component of their domain-specific experiments.

Dynamic sampling

Pouyanfar et al. [21] used a dynamic sampling technique to perform classification of imbalanced image data with a deep CNN. The basic idea is to over-sample the low performing classes and under-sample the high performing classes, showing the model less of what it has already learned and more of what it does not understand yet. This is somewhat analogous to how humans learn, by moving on from easy tasks once learned and focusing attention on the more difficult tasks. The author’s self-collected data set contains over 10,000 images captured from publicly available network cameras, including a total of 19 semantic concepts, e.g. intersection, forest, farm, sky, water, playground, and park. From the original data set, 70% is used for training models, 20% is used for validation, and 10% is set aside for testing. The authors report imbalance ratios in the data set as high as $\rho = 500$. Average F1-scores and weighted average F1-scores are used to compare the proposed model to a baseline CNN (A) and four alternative methods for handling class imbalance (B–E).

The system presented by Pouyanfar et al. includes three core components: real time data augmentation, transfer learning, and a novel dynamic sampling method. Real time data augmentation improves generalization by applying various transformations to select images in each training batch. Transfer learning is achieved by fine-tuning an Inception-V3 network [83] that was pre-trained using ImageNet [84] data. The dynamic sampling method is the main contribution relative to class imbalance.

$$\begin{aligned} \textit{Sample size }(F1_{i}, c_{j}) = \frac{1 - f1_{i,j}}{\sum \nolimits _{c_{k}\in {C}}(1 - f1_{i,k})}\times N^{*} \end{aligned}$$ (11)

$F1_{i}$ is a vector containing all individual class F1-scores after iteration i, and $f1_{i,j}$ denotes the F1-score for class j on iteration i, where F1-scores are calculated for each class in a one-versus-all manner. During the next iteration, classes with lower F1-scores are sampled at a higher rate, forcing the learner to focus more on examples previously misclassified. Eq. 11 is used to obtain the next iteration’s sample size for a given class $c_{j}$, where $N^{*}$ is the average class size. To prevent over-fitting of the minority group, a second model is trained through transfer learning without sampling. At time of inference, the output label is computed as a function of both models.

The proposed model is compared with the following alternative methods:

(A)

Basic CNN: VGGNet [85] CNN trained on entire data set.

(B)

Deep CNN features + SVM: Support vector machine (SVM) classifier trained with deep features generated by CNN.

(C)

Transfer learning without augmentation: Fine-tuned Inception-V3 with no data augmentation.

(D)

Transfer learning with augmentation: Fine-tuned Inception-V3 with data augmentation.

(E)

Transfer learning with balanced augmentation: Fine-tuned Inception-V3 with data augmentation that enforces class balanced training batches with over-sampling and under-sampling.

(F)

Proposed model: Dynamic sampling, data augmentation, and transfer learning on Inception-V3 network.

Average class-wise F1-scores are compared across all 19 concepts, showing that the basic CNN performs the worst in all cases. The basic CNN was unable to classify a single instance correctly for several concepts with very high imbalance ratios, including Playground and Airport with imbalance ratios of $\rho = 200$ and $\rho = 500$, respectively. Transfer learning methods (C–E) performed significantly better than the baseline CNN, increasing the weighted average F1-score from 0.630 to as high as 0.779. Results in Table 6 show that the proposed method (F) outperforms all other models tested on the given data set. Compared to transfer learning with basic augmentation (D), the dynamic sampling method (F) improved the weighted average F1-score from 0.779 to 0.794.

Table 6 Dynamic sampling with network camera image data [21]Full size table

The dynamic sampling method’s ability to self-adjust sampling rates is its most attractive feature. This allows the method to adapt to different problems containing varying levels of complexity and class imbalance, with little to no hyperparameter tuning. By removing samples that have already been captured by the network parameters, gradient updates will be driven by the more difficult positive class samples. The dynamic sampling method outperforms a hybrid of over-sampling and under-sampling (E) according to F1-scores, but the details of the sampling method are not included in the description. We also do not know how dynamic sampling performs against plain RUS and ROS, as these methods were not tested. This should be examined closely in future works to determine if dynamic sampling can be used as a general replacement for RUS and ROS. One area of concern is the method’s dependency on a validation set to calculate the class-wise performance metrics that are required to determine sampling rates. This will certainly be problematic in cases of class rarity, where very few positive samples exist, as setting aside data for validation may deprive the model of valuable training data. Methods for maximizing the total available training data should be included in future research. In addition, future research should extend the dynamic sampling method to non-CNN architectures and other domains.

ROS, RUS, and two-phase learning

Buda et al. [23] compare ROS, RUS, and two-phase learning using three multi-class image data sets and deep CNNs. MNIST [86], CIFAR-10, and ImageNet data sets are used to create distributions with varying levels of imbalance. Both MNIST and CIFAR-10 training sets contain 50,000 images spread evenly across 10 classes, i.e. 5000 images per class. Imbalanced distributions were created from MNIST and CIFAR-10 in the range of $\rho \in [10, 5000]$ and $\rho \in [2,50]$, respectively. The ImageNet training data, containing 100 classes with a maximum of 1000 samples per class, was used to create imbalanced distributions in the range of $\rho \in [10, 100]$.

A different CNN architecture was empirically selected for each data set based off recent state-of-the-art results. For the MNIST and CIFAR-10 experiments, a version of the LeNet-5 [58] and the All-CNN [87] architectures were used for classification, respectively. Baseline results were established for each CNN architecture by performing classification on the data sets without any form of class imbalance technique, i.e. no data sampling or thresholding. Next, seven different methods for addressing class imbalance were integrated with the CNN architectures and tested. ROC AUC was extended to the multi-class problem by averaging the one-versus-all AUC for each class, and used to compare the methods. A portion of their results are presented in Fig. 3.

Fig. 3

ROS, RUS, and two-phase learning with MNIST (a–c) and CIFAR-10 (d–f) [23]

Full size image (A)

ROS: All minority classes were over-sampled until class balance was achieved, where any class smaller than the largest class size is considered a minority class. In almost all experiments, over-sampling displayed the best performance, and never showed a decrease in performance when compared to the baseline.

(B)

RUS: All majority classes were under-sampled until class balance was achieved, where any class larger than the smallest class size is considered a majority class. RUS performed poorly when compared to the baseline models, and never displayed a notable advantage to ROS. RUS was comparable to ROS only when the total number of minority classes was very high, i.e. 80–90%.

(C)

Two-phase training with ROS: The model is first pre-trained on a balanced data set which is generated through ROS, and then fine-tuned using the complete data set. In general, this method performed worse than strict ROS (A).

(D)

Two-phase training with RUS: Similar to (C), except the balanced data set used for pre-training is generated through RUS. Results show that this approach was less effective than RUS (B).

(E)

Thresholding with prior class probabilities: The network’s decision threshold is adjusted during the test phase based on the prior probability of each class, effectively shifting the output class probabilities. Thresholding showed improvements to overall accuracy, especially when combined with ROS.

(F)

ROS and thresholding: The thresholding method (E) is applied after training the model with a balanced data set, where the balanced data set is generated through ROS. Thresholding combined with ROS performed better than the baseline thresholding (E) in most cases.

(G)

RUS and thresholding: The thresholding method (E) is applied after training the model with a balanced data set, where the balanced data set is produced through RUS. Thresholding with RUS performed worse than (E) and (F) in all cases.

The work by Buda et al., which varies levels of class imbalance and problem complexity, is comprehensive, having trained almost 23,000 deep CNNs across three popular data sets. The impact of class imbalance was validated on a set of baseline CNNs, showing that classification performance is severely compromised as imbalance increases and that the impact of class imbalance seems to increase as problem complexity increases, e.g. CIFAR-10 versus MNIST. The authors conclude that ROS is the best overall method for addressing class imbalance, that RUS generally performs poorly, and that two-phase learning with ROS or RUS is not as effective as their plain ROS and RUS counterparts.

While it does provide a reasonable high-level view of each method’s performance, the multi-class ROC AUC score provides no insight into the underlying class-wise performance trade-offs. It is not clear if there is high variance in the class-wise scores, or if one extremely low class-wise score is causing a large drop in the average AUC score. We believe that additional performance metrics, including class-wise scores, will better explain the effectiveness of each method in addressing class imbalance and help guide practitioners in model selection.

Buda et al. conclude that ROS should be performed until all class imbalance is eliminated. Despite their experimental results, we do not readily agree with this blanket statement, and argue that this is likely problem-dependent and requires further exploration. The MNIST data set, both relatively low in complexity and small in size, was used to demonstrate that over-sampling until all classes are balanced performs best. We do not know how well over-sampling to this level will perform on more complex data sets, or in problems containing big data or class rarity. Furthermore, over-sampling to this level of class balance in a big data problem can be extremely resource intensive, drastically increasing training time by introducing large volumes of redundant data.

Summary of data-level methods

Two of the surveyed works [23, 79] have shown that eliminating class imbalance in the training data with ROS significantly improves classification results. Lee et al. have shown that pre-training DNNs with semi-balanced data generated through RUS or augmentation-based over-sampling improves minority group performance. In contrast to Lee et al., Buda et al. found that plain ROS and RUS generally perform better than two-phase learning. Unlike Lee et al., however, Buda et al. pre-trained their networks with data that was sampled until class balance was achieved. Since Lee et al. and Buda et al. used different imbalance levels for pre-training, and reported results with different performance metrics, it is difficult to understand the efficacy of two-phase learning. The dynamic sampling method outperformed baseline CNNs, but we do not know how it compares to ROS, RUS, or two-phase learning. Despite this limitation, the dynamic sampling method’s ability to automatically adjust sampling rates throughout the training process is very appealing. Methods that can automatically adjust to varying levels of complexity and imbalance levels are favorable, as they reduce the number of tunable hyperparameters.

The experimental results suggest the use of ROS to eliminate class imbalance during the training of DNNs. This may be true for relatively small data sets, but we believe this will not hold true for problems containing big data or extreme class imbalance. Applying ROS until classes are balanced in very-large data sets, e.g. WHOI-Plankton data, will result in the duplication of large volumes of data and will drastically increase training times. RUS, on the other hand, reduces training time and may therefore be more practical in big data problems. We believe that RUS methods that remove redundant samples, reduce class noise, and strengthen class borders will prove helpful in these big data problems. Future work should explore these scenarios further.

All of the data-level methods presented were tested on class imbalanced image data with deep CNNs. In addition, differences in performance metrics and problem complexity make it difficult to compare methods directly. Future works should test these data-level methods on a variety of data types, imbalance levels, and DNN architectures. Multiple complementary performance metrics should be used to compare results, as this will better illustrate method trade-offs and guide future practitioners.

Algorithm-level methods

This section includes surveyed works that modify deep learning algorithms for the purpose of addressing class imbalance. These methods can be further divided into new loss functions, cost-sensitive learning, and threshold moving. Wang et al. [18] and Lin et al. [88] introduced new loss functions that allow the minority samples to contribute more to the loss. Wang et al. [89], Khan et al. [19], and Zhang et al. [90] experimented with cost-sensitive DNNs. The methods proposed by Khan et al. and Zhang et al. have the advantage of learning cost matrices during training. The work by Buda et al. in the "ROS, RUS, and two-phase learning" section has also been included in this section, as they experimented with threshold adjusting. Zhang et al. [91] combine transfer learning, CNN feature extraction, and a cluster-based nearest neighbor rule to improve the classification of imbalanced image data. Finally, Ding et al. [92] experiment with very-deep CNNs to determine if increasing neural network depth improves convergence rates with imbalanced data.

Mean false error (MFE) loss

Wang et al. [18] found some success in modifying the loss function as they experimented with classifying imbalanced data with deep MLPs. A total of eight imbalanced binary data sets, including three image data sets and five text data sets, were generated from the CIFAR-100 [93] and 20 Newsgroup [94] collections. The data sets are all relatively small, with most training sets containing fewer than 2000 samples and the largest training set containing just 3500 samples. For each data set generated, imbalance ratios ranging from $\rho = 5$ to $\rho = 20$ were tested.

The authors first show that the mean squared error (MSE) loss function poorly captures the errors from the minority group in cases of high class imbalance, due to many negative samples dominating the loss function. They then propose two new loss functions that are more sensitive to the errors from the minority class, mean false error (MFE) and mean squared false error (MSFE). The proposed loss functions were derived by first splitting the MSE loss into two components, mean false positive error (FPE) and mean false negative error (FNE). The FPE (Eq. 12) and FNE (Eq. 13) values are then combined to define the total system loss, MFE (Eq. 14), as the sum of the mean error from each class.

Wang et al. introduce the MSFE loss (Eq. 15) as an improvement to the MFE loss, asserting that it better captures errors from the positive class. The MSFE can be expanded into the form $\frac{1}{2}((FPE + FNE)^{2} + (FPE - FNE)^{2})$, demonstrating how the optimization process minimizes the difference between FPE and FNE. The authors believe this improved version will better balance the error rates between the positive and negative classes.

$$\begin{aligned} FPE= \frac{1}{N} \sum \limits _{i=1}^{N} \sum _{n} \frac{1}{2} (d_{n}^{(i)} - y_{n}^{(i)})^{2} \end{aligned}$$ (12) $$\begin{aligned} FNE= \frac{1}{P} \sum \limits _{i=1}^{P} \sum _{n} \frac{1}{2} (d_{n}^{(i)} - y_{n}^{(i)})^{2} \end{aligned}$$ (13) $$\begin{aligned} MFE = FPE + FNE \end{aligned}$$ (14) $$\begin{aligned} MSFE= FPE^{2} + FNE^{2} \end{aligned}$$ (15)

A deep MLP trained with the standard MSE loss is used as the baseline model. This same MLP architecture is then used to evaluate both the MFE and MSFE loss. Image classification results (Table 7) and text classification results (Table 8) show that the proposed models outperform the baseline in nearly all cases, with respect to F-measure and AUC scores.

Table 7 CIFAR-100 classification with MFE and MSFE [18]Full size table Table 8 20 Newsgroup classification with MFE and MSFE[18]Full size table

Results show that the MFE and MSFE loss functions outperform MSE loss in almost all cases. Improvements over the baseline MSE loss are most apparent when class imbalance is greatest, i.e. imbalance levels of 5%. The MFE and MSFE performance gains are also more pronounced on the image data than on the text data. For example, the MSFE loss improved the classification of Household image data, increasing the F1-score from 0.1143 to 0.2353 when the class imbalance level was 5%.

Being relatively easy to implement and integrate into existing models is one of the biggest advantages of using custom loss functions for addressing class imbalance. Unlike data-level methods that increase the size of the training set, the loss function is less likely to increase training times. The loss functions should generalize to other domains with ease, but as seen in comparing the image and text performance results, performance gains will vary from problem to problem. Additional experiments should be performed to validate MFE and MSFE effectiveness, as it is currently unclear how many rounds were conducted per experiment and the F1-score and AUC gains over the baseline are only minor improvements. On the image data experiments, for example, the average AUC gain over the baseline is just 0.025, with a median AUC gain over the baseline of only 0.008.

Focal loss

Lin et al. [88] proposed a model that effectively addresses the extreme class imbalance commonly encountered in object detection problems, where positive foreground samples are heavily outnumbered by negative background samples. Two-stage and one-stage detectors are well-known methods for solving such problems, where the two-stage detectors typically achieve higher accuracy at the cost of increased computation time. Lin et al. set out to determine whether a fast single-stage detector was capable of achieving state-of-the-art results on par with current two-stage detectors. Through analysis of various two-stage detectors (e.g. R-CNN [95] and its successors) and one-stage detectors (e.g. SSD [96] and YOLO [97]), class imbalance was identified as the primary obstacle preventing one-stage detectors from achieving state-of-the-art performance. The overwhelming number of easily classified negative background candidates create imbalance ratios commonly in the range of $\rho = 1000$, causing the negative class to account for the majority of the system’s loss.

To combat these extreme imbalances, Lin et al. presented the focal loss (FL) (Eq. 16), which re-shapes the cross entropy (CE) loss in order to reduce the impact that easily classified samples have on the loss. This is achieved by multiplying the CE loss by a modulating factor, ${\alpha _{t}(1 - p_{t})^{\gamma }}$. Hyper parameter $\gamma \ge 0$ adjusts the rate at which easy examples are down weighted, and $\alpha _{t} \ge 0$ is a class-wise weight that is used to increase the importance of the minority class. Easily classified examples, where ${p_{t} \rightarrow 1}$, cause the modulating factor to approach 0 and reduce the sample’s impact on the loss.

$$\begin{aligned} FL(p_{t}) = -\alpha _{t}(1 - p_{t})^{\gamma }\log (p_{t}) \end{aligned}$$ (16)

The proposed one-stage focal loss model, RetinaNet, is evaluated against several state-of-the-art one-stage and two-stage detectors. The RetinaNet model is composed of a backbone model and two subnetworks, where the backbone model is responsible for producing feature maps from the input image, and the two subnetworks then perform object classification and bounding box regression. The authors selected the feature pyramid network (FPN) [88], built on top of the ResNet architecture [98], as the backbone model and pre-trained it on ImageNet data. They found that the FPN CNN, which produces features at different scales, outperforms the plain ResNet in object detection. Two additional CNNs with separate parameters, the subnetworks, are then used to perform classification and bounding box regression. The proposed focal loss function is applied to the classification subnet, where the total loss is computed as the sum of the focal loss over all $\approx 100,000$ candidates. The COCO [99] data set was used to evaluate the proposed model against its competitors.

The first attempt to train RetinaNet using standard CE loss quickly fails and diverges due to the extreme imbalance. By initializing the last layer of the model such that the prior probability of detecting an object is ${\pi = 0.01}$, results improve significantly to an average precision (AP) of 30.2. Additional experiments are used to determine appropriate hyper parameters for the focal loss, selecting ${\gamma = 2.0}$ and ${\alpha = 0.25}$ for all remaining experiments.

Experiments by Lin et al. show that the RetinaNet, with the proposed focal loss, is able to outperform existing one-stage and two-stage object detectors. It outscores the runner-up one-stage detector (DSSD513 [100]) and the best two-stage detector (Faster R-CNN with TDM [101]) by 7.6-point and 4.0-point AP gains, respectively. When compared to several online hard example mining (OHEM) [102] methods, RetinaNet outscores the best method with an increase in AP from 32.8 to 36.0. Table 9 compares results between RetinaNet and seven state-of-the-art one-stage and two-stage detectors.

Table 9 RetinaNet (focal loss) on COCO [88]Full size table

Lin et al. provide additional information to illustrate the effectiveness of the focal loss method. In one experiment, they use a trained model to compute the focal loss over $\approx 10^{7}$ negative images and $\approx 10^{5}$ positive images. By plotting the cumulative distribution function for positive and negative samples, they show that as $\gamma$ increases, more and more weight is placed onto a small subset of negative samples, i.e. the hard negatives. In fact, with $\gamma = 2$, they show that most of the loss is derived from a very small fraction of samples, and that the focal loss is indeed reducing the impact of easily-classified negative samples on the loss. Run time statistics were included in the report to demonstrate that they were able to construct a fast one-stage detector capable of outperforming accurate two-stage detectors.

The new FL method lends itself to not just class imbalance problems, but also hard sample problems. It addresses the primary minority class gradient issue defined by Anand et al. by preventing the majority group from dominating the loss and allowing the minority group to contribute more to the weight updates. Similar to the MFE and MSFE loss functions, advantages include being relatively easy to integrate into existing models and having minimal impact on training time. We believe that the FL method’s ability to down-weight easily-classified samples will allow it to generalize well to other domains. The authors compared FL directly to the CE loss, but we do not know how FL compares to other existing class imbalance methods. Future works should compare this loss function to alternative class imbalance methods across a variety of data sets and class imbalance levels.

Nemoto et al. [103] later used the focal loss in another image classification task, the automated detection of rare building changes, e.g. new construction. The airborne building images are annotated with the labels: no change, new construction, rebuilding, demolished, repaint roofs, and laying solar panel. The training data contains 203,358 images in total, where 200,000 comprise the negative class, i.e. no change. The repaint roofs and laying solar panel classes contain just 326 and 222 images, respectively, yielding class imbalance ratios as high as $\rho = 900$.

The experiments by Nemoto et al. utilize the VGG-16 [85] CNN architecture, where the baseline CNN uses the standard CE loss. Images were augmented by rotation and inversion, creating 20,000 samples per class. Class-wise accuracy on the validation set was used to compare the focal loss to the CE loss. The first experiment uses the CE loss function and shows that the accuracy on the repaint roof and laying solar panel classes begins to deteriorate after 25,000 iterations, suggesting over-fitting. In the second experiment, focal loss was evaluated using the same VGG-16 architecture and the same image augmentation procedure, generating 20,000 images per class. Unlike the first experiment, however, only three classes were selected for training and validation: no change, repaint roof, and laying solar panel images. The value of the focal loss’s down-weighting parameter $\gamma$ was varied in the range $\gamma \in [0, 5]$ to better understand its impact.

Fig. 4

Focal loss and building change detection [103]

Full size image

Figure 4 shows the training validation loss and accuracy of the focal loss method. Nemoto et al. conclude that focal loss improves problems related to class imbalance and over-fitting by adjusting the per-class learning speed. By comparing the loss plots, it is clear that as $\gamma$ increases, the total loss decreases faster and the model is slower to over-fit. It is not clear, however, if FL with $\gamma > 0$ is producing better classification results. First, the results from the first experiment cannot be compared to the results of the second experiment, because the total number of classes has been reduced from six to three, significantly reducing the complexity of the classification problem. Second, the results from the no change and laying solar panel classes in Fig. 4 show that the highest accuracy is achieved when $\gamma = 0$, where the laying solar panel class is the smallest class. The focal loss, with $\gamma = 0$, reduces to the standard cross entropy loss, suggesting that the CE loss outperforms the focal loss on two of the three classes in this experiment. To better understand the effectiveness of FL, future work should include a baseline with consistent training data, several alternative methods for addressing class imbalance, and additional performance metrics.

Cost-sensitive deep neural network (CSDNN)

Wang et al. [89] employed a cost-sensitive deep neural network (CSDNN) method to detect hospital readmissions, a class imbalanced problem where a small percentage of patients are readmitted to a hospital shortly after their original visit. Two data sets were provided by the Barnes-Jewish Hospital, containing patient records spanning from 2007 to 2011. The first data set, general hospital wards (GHW), contains vital signs, clinical processes, demographics, real-time bedside monitoring, and other electronic data sources. Of the 2565 records in GHW, 406 patients were readmitted within 30 days and 538 patients were readmitted within 60 days, producing imbalance ratios of $\rho = 5.3$ and $\rho = 3.8$, respectively. The second data set, the operating room pilot data (ORP), contains vital signs, pre-operation data, laboratory tests, medical history, and procedure details. There are 700 records in the ORP data set, with 157 readmissions within 1 year and 124 readmissions within 30 days. The imbalance ratio of the ORP data is unclear; the authors merely state that the ORP data is less imbalanced than the GHW data ($\rho < 3.8$). A variety of performance metrics, including ROC, AUC, accuracy, recall, precision, positive predictive value (PPV), and negative predictive value (NPV) are used for evaluation. The PPV and NPV metrics are equivalent to positive and negative class precision scores. The proposed CSDNN method was found to outperform existing hospital readmission prediction systems, and has been deployed at the Barnes-Jewish Hospital.

Instead of representing categorical values with traditional one-hot encoding, Wang et al. use a categorical feature embedding approach to create more meaningful representations. In addition, they employed a CNN for automatic feature extraction from time series data, i.e. the patient vital signs data. The extracted features and categorical embeddings are concatenated, forming a final input feature vector that is fed to a DNN for classification. The DNN was composed of two hidden layers, with 128 and 64 neurons per layer. The CE loss function was modified to incorporate a pre-defined cost matrix, forcing the network to minimize misclassification cost. For the GHW data set, false negative errors were given a cost 2× the cost of false positive errors. Similarly for the ORP data set, the false negative cost was set to 1.5× the cost of false positive errors.

Wang et al.’s CSDNN method is compared to five baseline classifiers, including three decision tree methods, a SVM, and an ANN. Only one of the baseline methods addresses class imbalance, using under-sampling with a Random Forest (RF) classifier. The proposed method outperformed all of the baseline classifiers across all performance metrics except for NPV. We were able to observe the class-wise performance trade-off because multiple complementary performance metrics were reported. On the GHW data set, the CSDNN’s AUC of 0.70 outscored the runner up’s (ANN classifier) AUC of 0.62. Unfortunately, however, it is not possible to determine if the cost-sensitive loss function is the cause of the improved performance, as there are several other factors that may have contributed to improvements, e.g. categorical feature embedding and time-series feature extraction. A baseline that includes the CNN feature extractor and the categorical embeddings is required to isolate the efficacy of the cost-sensitive loss function.

Incorporating the cost matrix into the CE loss is a minor implementation detail that should generalize well to other domains and architectures with minimal impact on training times. The process of identifying an ideal cost matrix is likely the biggest limitation. In traditional machine learning problems with relatively small data sets, models can be validated across a range of costs and the best cost matrix can be selected for the final model. When working with DNNs and large data sets, however, this process of searching for the best cost parameters can be very time consuming and even impractical. Including results from a range of cost matrices in future work will help to demonstrate the class-wise performance trade-offs that occur as costs vary.

Learning cost matrices with cost-sensitive CNN (CoSen)

Khan et al. [19] introduced an effective cost-sensitive deep learning procedure which jointly learns network weight parameters and class misclassification costs during training. The proposed method, CoSen CNN, is evaluated against six multi-class data sets with varying levels of imbalance: MNIST, CIFAR-100, Caltech-101 [104], MIT-67 [105], DIL [106], and MLC [107]. Class imbalance ratios of $\rho = 10$ are tested for the MNIST, CIFAR-100, Caltech-101, and MIT-67 data sets. The DIL and MLC data sets have imbalance ratios of $\rho = 13$ and $\rho = 76$, respectively. The VGG-16, pre-trained on ImageNet data, is used as a feature extractor and the baseline CNN throughout the experiments.

The cost matrix that is learned by the CoSen CNN is used to modify the output of the VGG-16 CNN’s last layer, giving higher importance to samples with higher cost. The group presents three modified loss functions, incorporating the learned cost parameters into MSE loss, SVM hinge loss, and CE loss. The training process learns network weight parameters and misclassification cost parameters by keeping one fixed at a time, and minimizing cost with respect to the other during training. The cost matrix update is dependent on the current classification errors, the overall classification error, and the class-to-class separability (C2C). The C2C separability measures relationships between within-class sample distances and the size of class-separating boundaries.

The CoSen CNN is evaluated against the baseline CNN, multiple sampling methods, and multiple cost-sensitive methods. A two-layered neural network was used for the sampling classification methods, and a SVM and RF were used for the cost-sensitive methods. The SVM and RF baseline classifiers used the features that were extracted by the pre-trained VGG-16 CNN as input. The SOSR CNN is a cost-sensitive deep learning method that incorporates a fixed cost matrix into the loss function [108].

Overall accuracy was used to show that the proposed CNN outperforms all seven alternative techniques across all data sets. Table 10 shows that the CoSen CNN performed exceptionally well, outperforming the runner-up classifier by more than 5% on CIFAR-100, Caltech-101, and MIT-67. The second best classifier listed in the experimental results is between the SOSR CNN and the baseline CNN. In all cases SMOTE outperformed RUS, and hybrid sampling method SMOTE-RSB [109] outperformed SMOTE. Unfortunately, with accuracy being the only performance metric reported between all seven class imbalance methods, and accuracy being unreliable in cases of class imbalance, these results may be misleading.

Table 10 Cost-sensitive CoSen CNN results (accuracy) [19]Full size table

F1 and G-Mean scores were reported to show that the proposed CoSen CNN outperforms the baseline CNN on all data sets, e.g. increasing the F1-score from 0.389 to 0.416 on the Caltech-101 data set. Khan et al. also included sample network training times, showing that the added cost parameter training increases each training epoch by several seconds but has little to no impact on the inference step. In another experiment, the authors defined three fixed cost matrices using class representation, data separability, and classification errors to derive the costs, and showed that the dynamic cost matrix method outperforms all three cases.

It is interesting that the baseline CNN, with no class imbalance modifications, is a close runner-up to the CoSen CNN, outperforming the sampling methods, SVM, and RF classifiers in all cases. This does not imply that a deep CNN with no class imbalance modifications is better equipped to address class imbalance than traditional sampling methods. Rather, this demonstrates the power of re-using strong feature extractors that have been trained on large volumes of data, e.g. over 1.3 million images in this experiment.

As mentioned previously, one of the difficulties of cost-sensitive learning is choosing an appropriate cost matrix, often requiring a domain expert or a grid search procedure. The CoSen method proposed by Khan et al. removes this requirement, and offers more of an end-to-end deep learning framework capable of learning from class imbalanced data. The authors report results on a number of experiments across a variety of data sets, and the joint learning of network parameters and cost parameters appears to be an excellent candidate for learning from class imbalanced data. Future experiments should explore this cost-sensitive method’s ability to learn from problems containing alternative data types, big data, and class rarity.

Cost-sensitive DBN with differential evolution (CSDBN-DE)

Zhang et al. [90] set out to automatically learn costs for the purpose of cost-sensitive deep learning. More specifically, they use a differential evolutionary algorithm [110] to improve the cost matrix each training iteration, and incorporate these learned costs into a DBN. The proposed cost-sensitive deep belief network with differential evolution (CSDBN-DE) is evaluated against 42 data sets from the Knowledge Extraction based on Evolutionary Learning (KEEL) [111] repository. The training data sets are structured data sets, ranging in size from 171 samples to 3339 samples and containing between 6 and 13 attributes each. Imbalance ratios range from $\rho = 9.28$ up to $\rho = 128.21$.

The DBN pre-training phase follows the original layer-wise training method first proposed by Hinton et al. [65]. The cost matrix is then incorporated into the output layer’s softmax probabilities during the fine-tuning phase. Cost matrices are first randomly initialized, then training set evaluation scores are used to select a new cost matrix for the next population. Mutation and cross-over operations are applied to evolve and generate the next population of cost matrices. Once training is complete, the best cost matrix is selected and applied to the output layer of the DBN, forming the final version of the cost-sensitive DBN to be used for inference.

Accuracy and error rate are used to compare the proposed method to an extreme learning machine (ELM) [112]. The ELM network is a type of single-hidden layer feedforward network that does not use backpropagation and is known to train thousands of times faster than neural networks. Experiments are repeated 20 times and results are averaged. The proposed CSDBN-DE model outperformed the ELM network on 28 out of 42 data sets, i.e. 66% of the experiments.

Similar to the CoSen CNN [19], the biggest advantage of the CSDBN-DE is its ability to learn an effective cost matrix throughout training, as this is often a difficult and time-consuming process. Unfortunately, it is not very clear how well the CSDBN-DE handles class imbalanced data, because it is compared to a single baseline built on a completely different architecture. Also, the accuracy performance metric provides no real insight into the network’s true performance when class imbalance is present. The concept of incorporating an evolutionary algorithm to iteratively update a cost matrix shows promise, but more robust experiments are required to validate its ability to classify imbalanced data.

Output thresholding

In addition to the data sampling methods discussed in the "ROS, RUS, and two-phase learning" section, Buda et al. [23] experimented with adjusting CNN output thresholds to improve overall performance. They used the MNIST and CIFAR-10 data sets with varying levels of class imbalance ratios in the range of $\rho \in [1, 5000]$ and $\rho \in [1, 50]$, respectively. Accuracy scores were used to compare thresholding with the baseline CNN, ROS, and RUS methods as they were described in the "ROS, RUS, and two-phase learning" section.

The authors applied thresholding by dividing the network outputs for each class by its estimated prior probability, effectively reducing the likelihood of misclassifying examples from the minority group. They also considered hybrid methods by combining threshold moving with RUS and ROS. The thresholding method outperformed the baseline CNN for all levels of class imbalance on the MNIST and CIFAR-10 data sets. Thresholding outperformed ROS and RUS for all levels of imbalance on the MNIST data, but on the CIFAR-10 data there were several instances were ROS outperformed thresholding. Thresholding combined with ROS performed especially well, outperforming all other methods in nearly all cases.

As discussed in the "ROS, RUS, and two-phase learning" section, Buda et al. explored a variety of deep learning methods for addressing a wide range of class imbalance levels. They have shown that overall accuracy can be improved with threshold moving, and that it can be implemented relatively easily with prior class probabilities. Unfortunately, the accuracy score does not explain individual class-wise performances and trade-offs. Since thresholding is only applied during inference, it does not affect training times. This also means that thresholding does not impact weight tuning and therefore does not improve a model’s ability to discriminate between classes. Regardless, it is still an appropriate method for reducing majority class bias that can be quickly implemented on top of an already trained network to improve classification results.

Category centers

Zhang et al. [91] experimented with deep representation learning of class imbalanced image data from the CIFAR-10 and CIFAR-100 data sets. They present a method for addressing class imbalance, category centers (CC), that combines transfer learning, deep CNN feature extraction, and a nearest neighbor discriminator. Three class imbalanced distributions are created from each original data set through random under-sampling, i.e. Dist. A, Dist. B, and Dist. C. In Dist. A and Dist. B, half of the classes are reduced in size, creating imbalance levels of $\rho = 10$ and $\rho = 20$, respectively. In Dist. C, class reduction levels increase linearly across all classes with a max imbalance of $\rho = 20$. For example, Dist. C for the CIFAR-100 data set contains 25 images in each of the first 10 classes, 75 images in each of the next 10 classes, then 125 images in each of the next 10 classes, etc. The proposed model is compared to both a baseline CNN and an over-sampling method. The mean precision performance metric is used to compare results.

Zhang et al. observe that similar images of the same class tend to cluster well in CNN deep feature space. They discuss the decision boundary that is created by the final layer of the CNN, the classification layer responsible for separating these deep feature clusters, stating that there is a greater chance for large errors in boundary placement when class imbalance is present. To avoid this boundary error in scenarios of class imbalance, they propose using high-level features extracted by the CNN to calculate each class’s centroid in deep feature space. These category centers in deep feature space can then be used to classify new images, by assigning new images to their nearest deep feature category center. The VGG-16 network, pre-trained on ImageNet data, is used throughout the experiments as the baseline CNN. The proposed method fine-tunes the CNN on the imbalanced distribution, maps all images to their corresponding deep feature representations, then calculates the centroid for each class. At test time the trained CNN is used to extract features from new images, and each new image is assigned to the class of its nearest category center in feature space, defined by the Euclidean distance between the features. Zhang et al. claim that the category center is significantly more stable than the boundary generated by a CNN’s classification layer, but no evidence was provided to support this.

Table 11 Category centers with CIFAR-10 data (AP) [91]Full size table

A portion of Zhang et al.’s results are displayed in Table 11. The authors tried using two different CNN layers for feature extraction, the last convolutional layer (B) and the last fully connected layer (C), to determine if one set of features performed better. Finally, over-sampling was applied to the baseline and the category centers methods to determine the impact of data sampling on CC (D–F). These results show that the CC methods (B, C) outperform the baseline CNN (A) on mean precision for all three imbalance scenarios. Results also show that the addition of over-sampling (D–F) led to improved results on the CIFAR-10 distributions, with the exception of Dist. C. For example, the CC method with over-sampling (E) increased mean precision on Dist. A from a baseline (A) of 0.779 to 0.830. CIFAR-100 results were similar, in the sense that the proposed method (B, C) outperformed the baseline (A) for all three distributions. Unlike the CIFAR-10 data, interestingly, over-sampling did not improve the results of the proposed method when classifying the CIFAR-100 distributions, but it did improve the baseline CNN.

We believe that the biggest limitation to the CC method is that it is highly dependent on the DNN’s ability to generate discriminative features that cluster well. If a large volume of class balanced, labelled training data is not available to pre-train the DNN, deep feature boundaries may not be strong enough to apply the CC method. The specifics of the over-sampling method used in methods (D–F) are unknown, so we do not know if over-sampling was applied to a level of class balance or some other ratio. In addition, we do not know if these results are the average of several rounds of experiments, or are just results from a single instance. With no balanced distribution results, an unclear over-sampling method, and only one performance metric, it is difficult to understand how well the proposed CC method does in learning from class imbalance.

Very-deep neural networks

Ding et al. [92] experimented with very-deep CNN architectures, e.g. 50 layers, to determine if deeper networks perform better on imbalanced data. The authors observe in literature that the error surfaces of deeper networks have better qualities for training convergence than smaller sized networks [113, 114]. They claim that larger networks contain more local minimum with good performance, making acceptable solutions easier to locate with gradient descent. The networks are tested on the problem of Facial Action Units (FAUs) recognition, i.e. the detection of basic facial actions as defined by the Facial Action Coding System [115].

The EmotioNet Challenge Track 1 data set [116] is used to compare methods. From the original data set, with over one million images, 450,000 images were randomly selected for training and 40,000 images were randomly selected for validation. The images in this data set contain 11 possible FAUs. Since an image can be positive for more than one FAU, the authors treated the classification problem as a set of 11 binary problems. Several FAUs are present in less than 1% of the data set, e.g. nose wrinkler, chin raiser, and upper lid raiser. The lip stretcher FAU is only present in 0.01% of the data, creating a max imbalance ratio to $\rho = 10,000$. Six deep CNN architectures are compared, all of which include $3\times 3$ convolution filter layers and are trained for 100 epochs:

(A)

Handcraft: 6-layer CNN

(B)

Plain34: 34-layer CNN

(C)

Res10: 10-layer ResNet CNN

(D)

Res18: 18-layer ResNet CNN

(E)

Res34: 34-layer ResNet CNN

(F)

Res50: 50-layer ResNet CNN.

F1-scores from the first experiment are presented in Table 12. The shallow network (A) was unable to capture classes containing high imbalance, e.g. AU2–AU5, during the 100 training epochs. The non-residual 34-layer CNN (B) saw a large boost in performance compared to the shallow network (A), with average F1-score increasing from 0.29 to 0.48. It can also be observed that the 10-layer ResNet (C) achieved an equal F1-score of 0.48. The FAU with the largest imbalance, i.e. lip stretcher (AU20), received an F1-score of 0.0 in all experiments. There is no noticeable difference in F1-scores between the networks with 10 layers or more (B–F). The 34-layer ResNet (E) won first place at the first track EmotioNet Challenge in 2017 [116].

Table 12 Comparing very-deep CNNs on FAU recognition (F1-score) [92]Full size table

In a second experiment, Ding et al. compare the convergence rate of a very-deep 18-layer CNN to a shallower 6-layer CNN across 100 epochs. The experiment shows that the very-deep 18-layer CNN converges faster than the shallower network, as the error decreases at a faster rate. The authors suggest that the shallower network will continue to converge, provided additional epochs.

The experiments by Ding et al. show that additional hidden layers can increase the convergence rate on facial action recognition. Training very-deep neural networks comes at a cost, however, as it increases the total number of matrix operations and the memory footprint. Additional experiments are required to determine if this increased convergence rate is observed with alternative deep learning architectures and data sets. With the convergence rate in question, other methods that have been shown to impact convergence rates should be included in future studies, e.g. alternative optimization methods, dynamic learning rates, and weight initialization.

Summary of algorithm-level methods

This section included eight algorithm-level methods for addressing class imbalance with DNNs. The MFE and MSFE loss functions presented by Wang et al. outperform the standard MSE loss on several image and text data sets. Lin et al.’s focal loss, which down-weights easy-to-classify samples, was used to outperform several state-of-the-art one-stage and two-stage object detectors on the COCO data set. Nemoto et al. also explored focal loss for the purpose of detecting rare building changes, but the results were somewhat contradictory. A cost-sensitive method that incorporates pre-defined misclassification costs into the CE loss function was used by Wang et al. to predict hospital readmissions. Khan et al. proposed a cost-sensitive deep CNN (CoSen) that jointly learns network parameters and cost matrix parameters during training. Overall accuracy was used to show that the CoSen CNN outperforms a baseline CNN, multiple sampling methods, and multiple cost-sensitive methods consistently across six image data sets. Similarly, Zhang et al. combined a DBN with an evolutionary algorithm that searches for optimal misclassification costs, but results were again reported with the accuracy metric and are difficult to interpret. Buda et al. demonstrated how prior class probabilities can be used to adjust DNN output thresholds to improve overall accuracy in classifying image data. Zhang et al. presented category centers, a method that addresses class imbalance by combining transfer learning, deep CNN feature extraction, and a nearest deep feature cluster discrimination rule. Finally, Ding et al. experimented with very-deep neural networks ($> 10$ layers) and showed that deeper networks may converge faster due to changes in the error surfaces that allow for faster optimization.

Unlike data sampling methods, the algorithm-level methods presented do not alter the training data and do not require any pre-processing steps. Compared to ROS, which was recommended by multiple authors in the "Data-level methods" section, algorithm-level methods are less likely to impact training times. This suggests that algorithm-level methods may be better equipped for big data problems. With the exception of defining misclassification costs, the algorithm-level methods require little to no tuning. Fortunately, two methods were presented for automatically learning cost parameters. Methods which are able to adapt to different problems with minimal tuning are preferred, as they can be quickly applied to new problems and do not require specific domain knowledge. The focal loss function and CoSen CNN demonstrate this flexibility, and we believe they will generalize well to many complex problem domains.

In general, there is a lack of research that appropriately compares deep learning algorithm-level methods to alternative class imbalance methods. This is due to poor choices in baseline models, insufficient performance metrics, and domain-specific experiments that fail to isolate the proposed class imbalance method. Similar to the surveyed deep learning data-level methods, most of the methods in this section were evaluated on image data with deep CNNs. We are most interested in understanding how deep learning methods for addressing class imbalance compare to each other and to traditional machine learning techniques for class imbalance. We believe that filling these research gaps and properly evaluating these methods will have a great impact on future deep learning applications.

Hybrid-methods

The deep learning techniques for addressing class imbalance in this section combine algorithm-level and data-level methods. Huang et al. [22] use a novel loss function and sampling method to generate more discriminative representations in their Large Margin Local Embedding (LMLE) method. Ando and Huang [117] presented the first deep feature over-sampling method, Deep Over Sampling (DOS). Finally, class imbalance in large-scale image classification is addressed by Dong et al. [118] with a novel loss function and hard sample mining.

Large Margin Local Embedding (LMLE)

Huang et al. [22] proposed the LMLE method for learning more discriminative deep representations of imbalanced image data. The method is motivated by the observation that minority groups are sparse and typically contain high variability, allowing the local neighborhood of these minority samples to be easily invaded by samples of another class. By combining a new informed quintuplet sampling method with a new triple-header hinge loss function, deep feature representations that preserve same class locality and increase inter-class discrimination are learned from imbalanced image data. These deep feature representations, which form well-defined clusters, are then used to label new samples with a fast cluster-wise K-NN classification method. The proposed LMLE method is shown to achieve state-of-the-art results on the CelebA [119] data set, which contains high imbalance levels up to $\rho = 49$.

The quintuplet sampling method selects an anchor and four additional samples based on inter-class and intra-class cluster distances. During training, each mini-batch selects an equal number of quintuplets from both the minority and majority classes. The five samples obtained by the quintuplet sampling method are fed to five identical CNNs, and their outputs are aggregated into a single result. The triple-header hinge loss is then used to compute the error and update the network parameters accordingly. This regularized loss function constrains the deep feature representation such that clusters collapse into small neighborhoods with appropriate margins between inter-class and intra-class clusters.

Table 13 LMLE CelebA facial attribute recognition (Balanced Accuracy) [22]Full size table

The CelebA data set contains facial images annotated with 40 attributes, with imbalance levels as high as $\rho = 49$ (Bald vs not Bald). A total of 160,000 images are used to train a CNN, and the learned deep representations are then fed to a modified K-NN classifier. The Balanced Accuracy (Eq. 9) metric is calculated for each facial attribute. The LMLE-kNN model is compared to three state-of-the-art models in the facial attribute recognition domain: Triplet-kNN [120], PANDA [121], and ANet [122]. Results in Table 13 show that the LMLE-kNN performs as good as, or better than, alternative methods for all facial attributes. Performance gains over alternative methods increased as levels of imbalance increased, e.g. LMLE increased accuracy on the Bald attribute from 75 to 90 when compared to its runner-up. The authors achieved similar results when comparing the LMLE-kNN to 14 state-of-the-art models on the BSDS500 [123] edge detection data set.

Huang et al. were one of the first to study deep representation learning with class imbalanced data. The proposed LMLE method combines several powerful concepts, achieves state-of-the-art results on a popular image data benchmark, and shows potential for future works in other domains. The high-performance results do come with a cost, however, as the system is both complex and computationally expensive. The data must be pre-clustered before quintuplet sampling can take place, which may be difficult if a suitable feature extractor is not available for the given data set. The authors do suggest that the initial clustering is not important, because the learning procedure can be used to re-cluster the data throughout training. Furthermore, each forward pass through the system requires computation from five CNNs, one for each of the quintuplets sampled. We are in agreement with the statement made by Ando and Huang [117], that adapting LMLE to new problems will require many task and model specific configurations. This will likely deter many practitioners from using this method when trying to solve class imbalanced problems, as most will resort to simpler and faster methods. Results provided by Dong et al. in the "Class rectification loss (CRL) and hard sample mining" section will show that LMLE outperforms ROS, RUS, thresholding, and cost-sensitive learning on the CelebA data set.

Deep over-sampling (DOS)

Ando and Huang [117] introduced over-sampling to the deep feature space produced by CNNs in their DOS framework. The proposed method is extensively evaluated by generating imbalanced data sets from five popular image benchmark data sets, including MNIST, MNIST-back-rot, SVHN [124], CIFAR-10, and STL-10 [125]. Class-wise precision and recall, F1-scores, and AUC are used to evaluate the DOS framework against alternative methods.

The DOS framework consists of two simultaneous learning procedures, optimizing the lower layer and upper layer parameters separately. The lower layers are responsible for acquiring the embedding function, while the upper layers learn to discriminate between classes using the generated embeddings. In order to learn the embedding features, the CNN’s input is presented with both a class label and a set of deep feature targets, an in-class nearest neighbor cluster from deep feature space. Then the micro-cluster loss computes the distances between each of the deep feature targets and their mean, constraining the optimization of the lower layers to shift deep feature embeddings towards the class mean. Ando and Huang state that shifting target representations to their corresponding local mean will induce smaller in-class variance and strengthen class distinction in the learned representations. The upper layers used for discriminating between classes are trained by taking the weighted sum of the CE losses, i.e. the CE loss for each deep feature target.

The deep over-sampling component is the process of selecting k in-class neighbors from deep feature space. In order to address class imbalance, the number of in-class neighbors to select should vary between classes. For example, using $k = 3$ for the minority class, and $k = 0$ for the majority class, will supplement the minority class with additional embeddings while leaving the majority class as is.

In the first DOS experiment, they compare the DOS framework with two alternative methods for handling class imbalance, Large Margin Local Embedding (LMLE) [22] and Triplet re-sampling with cost-sensitive learning (TL-RS-CSL). MNIST back-rotation images were used to create three data sets with class reduction rates of 0%, 20%, and 40%. It was shown that the DOS framework outperformed TL-RS-CSL and LMLE for reduction rates of 20% and 40%, and that DOS performance deteriorates slower than both TL-RS-CSL and LMLE as imbalance levels increase. For example, on imbalanced MNIST-back-rot data (40% reduction), DOS scored an average class-wise recall of 75.43, compared to the LMLE score of 70.13.

Table 14 DOS with varying imbalance [117]Full size table Table 15 DOS with varying k [117]Full size table

In a second experiment, the DOS framework is compared to a basic CNN and a K-NN classifier that is trained with CNN deep features (CNN-CL). Three different data sets are used to generate imbalanced data sets with reduction rates of 90%, 95%, and 99%, producing imbalance ratios up to $\rho = 100$. The results in Table 14 show again that the improved performance of the DOS framework becomes more apparent as the level of class imbalance increases. The third level of imbalance, with reduction rates of 99%, show that the DOS framework clearly outperforms both the CNN and CNN-CL across all data sets and all performance metrics.

The final experiment uses balanced data sets to examine the sensitivity of DOS parameter k, which defines the number of deep feature in-class neighbors to sample. Table 15 shows that when k is increased to 10 the performance on the minority group begins to deteriorate. These results also show that the proposed DOS framework outperforms the baseline CNN and CNN-CL on balanced data sets.

The DOS method displayed no visible performance trade-offs between the majority and minority groups, or between precision and recall. Its ability to outperform alternative methods when class imbalance is not present is also very appealing, as this quality is not common among class imbalance learning methods. Most importantly, some of the performance gains observed when comparing the CNN or CNN-CL to the DOS model on the minority group were very large, e.g. MNIST-back-rot and SVHN F1-scores increasing from 0.43 to 0.66 and 0.42 to 0.64, respectively, under an imbalance reduction rate of 99%. Like other hybrid methods, DOS is more complex than more common data-level and algorithm-level methods, which may discourage statisticians from using it. We agree with Ando and Huang, however, that the DOS method is generally extendable to other domains and deep learning architectures. Future works should evaluate the DOS method in these new contexts and should compare results to other class imbalance methods presented throughout this survey.

Class rectification loss (CRL) and hard sample mining

Dong et al. [118] present an end-to-end deep learning method for addressing high class imbalance in large-scale image classification. They explicitly distinguish their work from more traditional small-scale class imbalance problems containing small levels of imbalance, suggesting that this method may not generalize to small-scale problems. Experimental results compare the proposed method against data sampling, cost-sensitive learning, threshold moving, and several relevant state-of-the-art methods. Mean sensitivity scores are used as the primary performance metric by averaging together class-wise recall scores. The proposed method combines hard sample mining from the minority groups with a regularized objective function, the class rectification loss (CRL).

The hard sample mining selects minority samples which are expected to be more informative for each mini-batch, allowing the model to learn more effectively with less data. Dong et al. strive to rectify, or correct, the class distribution bias in an incremental manner, resulting in progressively stronger minority class discrimination through training. Unlike LMLE [22], which attempts to strengthen the structure of both majority and minority groups, this method only enhances minority class discrimination. To identify the minority classes, all classes are sorted by their size and then selected from smallest to largest until they collectively sum up to at most half the size of the data set, ensuring that the minority classes account for at most half the training batch. Both class-level and instance-level hardness metrics are considered when performing hard sample mining. At the class-level, hard positives refer to weak recognitions, the correctly labelled samples with low prediction scores. Alternatively, hard negatives are the obvious mistakes, the samples incorrectly labelled with high prediction scores. At the instance-level, hard positives are defined by correctly labelled images with far distances in feature space distance, while hard negatives are the images incorrectly labelled which are close in feature space.

The CRL loss function (Eq. 17), where $\alpha = \eta \Omega _{imb}$ is linearly proportional to the level of class imbalance, places a batch-wise class balancing constraint on the optimization process. This reduces the learning bias caused by the majority group’s over-representation. The CRL regularization imposes an imbalance-adaptive learning mechanism, applying more weight to the more highly imbalanced labels, while reducing the weight for the less imbalanced labels. Three different loss criteria $\mathcal {L}_{crl}$ are explored through experimentation, where the Triplet Ranking loss [126] is selected as the default for many of the experiments.

$$\begin{aligned} \mathcal {L}_{bln} = \alpha \mathcal {L}_{crl} + (1 - \alpha )\mathcal {L}_{ce},~~~ \alpha = \eta \Omega _{imb} \end{aligned}$$ (17)

Experiments are conducted by extending state-of-the-art CNN architectures with the proposed method and performing classification on three benchmark data sets. The CelebA data set contains a max imbalance level of $\rho = 49$. The X-Domain [127] data set contains 245,467 retail store clothing images that are annotated with 9 multi-class attribute labels, 165,467 of which are set aside for training. The X-Domain contains extreme class imbalance ratios of $\rho > 4000$. Several imbalanced data sets are generated from the CIFAR-100 set, with imbalance ratios up to $\rho = 20$, to demonstrate how CRL handles increasing levels of imbalance.

Table 16 CRL with CelebA Data (Balanced Accuracy) [118]Full size table

Table 16 contains the CelebA facial attribute recognition results. All class imbalance methods presented were implemented on top of the 5-layer CNN, DeepID2, [128]. Referring to the bottom half of the data attributes, with imbalance ratios $\rho \ge 6$, under-sampling almost always performs worse than over-sampling, cost-sensitive learning, and threshold moving. According to the mean sensitivity scores, over-sampling and cost-sensitive learning perform the same, outperforming thresholding and under-sampling by 3% and 4%, respectively.

When compared to the best non-imbalanced learning method (DeepID2) and the best imbalanced learning method (LMLE), the proposed CRL method improves the mean sensitivity by 6% and 3%, respectively. By analyzing the attribute-level scores, we can see that LMLE outperforms CRL in many cases where class imbalance levels are low, e.g. outperforming CRL on the attractive attribute by 7%. Dong et al. acknowledge that LMLE appears to handle low-level imbalance better than CRL. For many of the high-imbalance attributes, the non-imbalance method DeepID2 outperforms the LMLE method. It is when class imbalance is highest that the proposed CRL method performs its best and achieves its most significant performance gains over alternative methods.

In another experiment with the X-Domain data set, CRL outperforms LMLE on all categories, achieving a mean sensitivity performance gain over LMLE of almost 5%. Cost-sensitive learning, thresholding, and ROS all score roughly 6% below CRL, while RUS again performs very poorly. When experimenting with balanced distributions from the CIFAR-100 data set, the CRL method is applied to three state-of-the-art CNN models: CifarNet [129], ResNet32 [98], and DenseNet [130]. For each model, the CRL method consistently improves upon the baseline, increasing mean sensitivity by 3.6%, 1.2%, and 0.8%, respectively. Additional cost analysis by Dong et al. shows that CRL is significantly faster than LMLE, taking just 27.2 h to train versus LMLE’s 166.6 h on the given test case.

The advantages of CRL over alternative methods on large-scale class imbalanced image data were demonstrated through comprehensive experiments. It was shown to efficiently outperform many popular class imbalanced methods, including ROS, RUS, thresholding, cost-sensitive learning, and LMLE. It was also shown to train significantly faster than the LMLE method. The authors were clear to distinguish this method as effective in large-scale image data containing high levels of imbalance, and results on facial attribute detection showed that LMLE performed better for many facial attributes with lower levels of imbalance. This suggests that the CRL method may not be a good fit for class imbalance problems containing low levels of class imbalance. Despite this initial observation, CRL should be evaluated on a wide range of data sets with varying levels of complexity to better understand when it should be used. Exploring the CRL method’s ability to learn from non-image data is also of interest.

Summary of hybrid methods

This section analyzed three hybrid deep learning methods for addressing class imbalance. The LMLE method proposed by Huang et al. outperformed several state-of-the-art models on the CelebA and BSDS500 image data sets. Ando and Huang introduced the DOS method, which learns an embedding layer that produces more discriminative features, and then supplements the minority group by over-sampling in deep feature space. DOS was shown to outperform LMLE and baseline CNNs on multiple class imbalanced image data sets. Lastly, Dong et al.’s CRL loss function was shown to outperform four state-of-the-art models, ROS, RUS, cost-sensitive learning, output thresholding, and LMLE on the CelebA data set.

The results provided by Dong et al. are most informative, as they compare their CRL method to LMLE and several other data-level and algorithm-level deep learning methods for addressing class imbalance. Not only did they show that the CRL method outperforms alternative methods as class imbalance levels increase, but they also illustrated the effectiveness of RUS, ROS, thresholding, and cost-sensitive learning. From their results on the CelebA data set, we observe that ROS and cost-sensitive learning outperform RUS and threshold moving on average. These results are consistent with those from the "Data-level methods" section, suggesting that ROS is generally a good choice in addressing class imbalance with DNNs. Dong et al. also confirmed our original observation, that LMLE is rather complex and resource intensive, by showing that CRL is capable of training 6× faster than LMLE. Comparing DOS and CRL directly will prove useful, as both methods were shown to outperform the LMLE method.

In general, hybrid methods for learning from class imbalanced data will be more complex, and more difficult to implement than algorithm-level methods and data-level methods. This is expected, since both algorithm-level and data-level methods are being combined for the purpose of improving classification performance. As learners become more complex, their flexibility and ease of use will decrease, which may make them more difficult to adapt to new problems.

Class-wise performance scores on the CelebA data set in Table 16 show that different methods perform better when subjected to different levels of class imbalance. This observation supports our demand for future research that evaluates multiple deep learning methods across a variety of class imbalance levels and problem complexities. We expect to see similar trade-offs between models when they are evaluated across a variety of data types and DNN architectures. Filling these gaps in current research will help to shape future deep learning applications that involve class imbalance, class rarity, and big data.

【本文地址】

公司简介

联系我们

今日新闻

mortal imbalanced

点击排行

实验室常用的仪器、试剂和: 说到实验室常用到的东西，主要就分为仪器、试剂和耗

不用再找了，全球10大实验: 01、赛默飞世尔科技（热电）Thermo Fisher Scientif

三代水柜的量产巅峰T-72坦: 作者：寞寒最近，西边闹腾挺大，本来小寞以为忙完这

通风柜跟实验室通风系统有: 说到通风柜跟实验室通风，不少人都纠结二者到底是不

集消毒杀菌、烘干收纳为一: 厨房是家里细菌较多的地方，潮湿的环境、没有完全密

实验室设备之全钢实验台如: 全钢实验台是实验室家具中较为重要的家具之一，很多

图片新闻

实验室药品柜的特性有哪些: 实验室药品柜是实验室家具的重要组成部分之一，主要

小学科学实验中有哪些教学: 计算机计算器一般打孔器打气筒仪器车显微镜

实验室各种仪器原理动图讲: 1.紫外分光光谱UV分析原理：吸收紫外光能量，引起分

高中化学常见仪器及实验装: 1、可加热仪器：2、计量仪器：（1）仪器A的名称：量

微生物操作主要设备和器具: 今天盘点一下微生物操作主要设备和器具，别嫌我啰嗦

浅谈通风柜使用基本常识: 　众所周知，通风柜功能中最主要的就是排气功能。在

Survey on deep learning with class imbalance

Survey on deep learning with class imbalance

今日新闻

点击排行

推荐新闻

图片新闻

专题文章